1 Tutorial goals and set up

1.1 Objectives

Our main objectives are to handle some of the common data problems encountered with marine movement and diving data, using a narwhal dataset provided by Dr. Marianne Marcoux. Specifically, we aim to address the following:

  • Irregular locations

  • Time gaps

  • Including diving covariates

2 Loading the required packages

First, we’ll setup the workspace with required packages.

library(momentuHMM)
library(dplyr)
library(tidyr)
library(lubridate)
library(adehabitatLT)
library(sf)
library(tmap)
library(terra)
library(units)
library(stringr)
library(diveMove)
library(conicfit)
library(car)
library(mitools)
library(doFuture)
library(corrplot)  # Pearson's correlation matrix using cor()

Please also set working directory to “Day2” of the HMM workshop:

setwd("Day2")

3 Import data and initial data processing

For simplicity, we will also only look at the data for the month of August, 2017.

tracks <- read.csv("data/tracks.csv") %>%
  filter(!is.na(x) & !is.na(y)) %>% # remove missing locations
  mutate(
    time = ymd_hms(time), # define time
    loc_class = factor(loc_class, # define location class factor levels
      levels = c("GPS", 3, 2, 1, 0, "A", "B"))) %>% 
  # remove identical records
  filter(!(time == lag(time) & x == lag(x) & y == lag(y) & loc_class == lag(loc_class)),
  # filter only the month of august
  month(time) == 8, day(time) > 7,  day(time) <= 14)

The classic HMM assumes the observation data is in discrete time and that there is no missing data in the predictor variables.

# Calculate time difference between locations
tracks <- tracks %>%
  mutate(dt = ifelse(ID == lead(ID), # If next data row is same individual
    difftime(lead(time), time, units = "mins"), NA
  )) # calculate time difference
# Look at the frequency of time differences
hist(tracks$dt, 1000, main = NA, xlab = "Time difference (min)")

Next, we’ll convert the data to a spatial dataset using the sf package and plot the data. First, we define the coordinate reference system of the original data (in this case WGS84, which is defined by the EPSG code 4326). Next, we will project the data into NAD83(CSRS) UTM zone 21N (EPSG:2962), which will projected the coordinates in meter units with minimal distortion for this data set.

tracks <- tracks %>%
  st_as_sf(coords = c("x", "y")) %>% # converts to an sf object
  st_set_crs(4326) %>% # define CRS
  st_transform(2962) # reproject data to a UTM

For the first part of this tutorial, we’ll use only the fastloc GPS data so we don’t have to deal with location error.

# filter GPS locations only
tracks_gps <- tracks %>% 
  filter(loc_class=="GPS")

We lose some data, particularly near the end of the tracks, but we will integrate ARGOS locations later in this tutorial.

Now, we can map the data using the tmap package to visualize what it looks like.

# plot GPS
tracks_gps %>% 
  group_by(ID) %>% 
  summarise(do_union = F) %>% 
  st_cast("LINESTRING") %>% 
     tm_shape() +
   tm_lines(col = "ID", palette = "Dark2")

3.1 Selecting a time interval for the HMM

There are two key decisions we must make, the temporal resolution to use, and how to address data gaps. The desired resolution depends predominantly on the biological question you are asking as different behaviours and biological processes occur at different spatial and temporal scales (e.g. seasonal migration, daily movement between foraging and resting grounds, and fine scale foraging decisions). Generally, higher resolution data is preferred as it has more information, however it is possible to have too-high of a resolution wherein information from fine-scale variability drowns out the signal from coarse-scale patterns of interest (e.g., seasonal migration). In this case, we will be linking the movement data with high resolution (75 s) dive data to look at finer-scale behaviours (on the order of a few hours). My rule of thumb, is that you want 3-50 data points per behaviour. For behaviours spanning several hours, that roughly corresponds to a desired resolution between 2 min and 60 min.

Let’s see what resolutions may be possible in the data by looking at the most frequent time gaps.

# identify the most frequent dt
tracks_gps %>% 
  {table(.$dt)} %>% 
  sort(decreasing = TRUE) %>% 
  head()
## 
##  10  11   1   9  12   2 
## 299 109 101  75  72  56
# Visualize time differences
hist(tracks_gps$dt, 1000, main = NA, xlab = "Time difference (min)")

# Zoom in on short intervals
hist(tracks_gps$dt, 1000, main = NA, xlab = "Time difference (min)", xlim = c(0,100))

We see that that the most frequent time gap is 10 min, followed by 11, 1, 9, and 12 min. We also see the majority of the gaps are < 60 min, however some are in excess of 600 min. Although it may be possible to use 1 or 2 min resolution data, we will have to address the data gaps. Frequent and large data gaps can be difficult to handle, especially as the number of missing data points approaches or exceeds the existing data; we really want to avoid this. Let’s examine the potential data structure at different resolutions for the different animals.

We first create a function that can calculate the proportion of missing locations we would have for a given resolution.

# Make function to estimate proportion of missing location 
p_na <- function(t_0, t_max, n_loc, resolution) {
  max_n_loc <- length(seq(t_0, t_max, by = as.difftime(resolution, units = "mins")))
  n_NA <- max_n_loc - n_loc
  n_NA / max_n_loc
}

We can now use this function look at the proportion of NAs we would get with 1, 2, 5, 10, and 20 min resolution.

# summarise tracks
track_summary <-
    tracks_gps %>% st_drop_geometry() %>% 
  # limit to core period with data
  group_by(ID) %>% 
    filter(dt >= 1) %>% # remove duplicate time (just for this stage)
    summarise(n_loc = n(), # number of locations
              p_NA_1m = p_na(first(time), last(time), n_loc, 1),  # 1 min 
              p_NA_2m = p_na(first(time), last(time), n_loc, 2),  # 2 min 
              p_NA_5m = p_na(first(time), last(time), n_loc, 5),  # 5 min 
              p_NA_10m =p_na(first(time), last(time), n_loc, 10),  # 10 min 
              p_NA_20m =p_na(first(time), last(time), n_loc, 20))  # 20 min 
track_summary
## # A tibble: 3 × 7
##   ID      n_loc p_NA_1m p_NA_2m p_NA_5m p_NA_10m p_NA_20m
##   <chr>   <int>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
## 1 T172062   485   0.950   0.900   0.751    0.501  0.00206
## 2 T172064   479   0.951   0.902   0.754    0.508  0.0164 
## 3 T172066   491   0.949   0.899   0.747    0.493 -0.0124

Here we see that despite the large number of 1 min time steps, at that resolution, >95 /% of the potential locations are missing. Even at the 10 min interval, > 50% of the locations would be missing. Very large data gaps that contribute to much of the missing locations can be excluded from the analysis, therefore, for this tutorial, I will use a 10 min resolution as a compromise between high-resolution information and good temporal coverage.

3.2 Handling data gaps

There are several ways to deal with data gaps, and I will address four 1. Interpolation (linear or statistical) 2. Nullification 3. Path segmentation 4. Multiple imputation

3.2.1 Approach 1. Interpolation

For large datasets with few and small gaps, the simplest approach to use linear interpolation. First, let’s identify the most likely minute we have data.

# which minute has the most data
tracks_gps %>% 
  st_drop_geometry() %>%  # must convert back to data.frame
  group_by(ID) %>% 
  summarise(minute = str_sub(minute(time),-1)) %>% 
  table()
## `summarise()` has grouped output by 'ID'. You can override using the `.groups`
## argument.
##          minute
## ID          0   1   2   3   4   5   6   7   8   9
##   T172062  81  78  62  52  37  40  40  45  37  32
##   T172064  77  68  71  67  32  32  40  32  31  38
##   T172066 138  52  73  48  44  42  37  27  22  18

It looks like for all three tracks, the most amount of locations fall on 0 min (i.e., 10, 20, 30, 40, 50, 60 min). Next, for each track, we will create a vector of times in which to estimate locations.

# convert tracks back to data.frame with xy coordinates
tracks_gps_ls <- tracks_gps %>% 
  mutate(x = st_coordinates(tracks_gps)[,"X"],  
         y = st_coordinates(tracks_gps)[,"Y"]) %>%
  st_drop_geometry() %>% 
    split(.,.$ID)  # split into list

# create full time series on which to estimate locations rounded to the nearest 10 min
tracks_gps_ls_time <- tracks_gps %>% 
  st_drop_geometry() %>%  # convert to data.frame
  group_by(ID) %>%
  summarise(time = seq(round_date(first(time), "10 min"), 
                       round_date(last(time), "10 min"),
                       by = 60*10)) %>% 
  split(.,.$ID)  # split into list
## `summarise()` has grouped output by 'ID'. You can override using the `.groups`
## argument.

Now, we can interpolate the locations for each track.

# function to create a data frame with approximated locations
approx_locs <- function(tracks, times){
  data.frame(ID = times$ID,
        time = times$time,
        x = approx(tracks$time, tracks$x,
                   xout = times$time)$y,
        y = approx(tracks$time, tracks$y,                               
                   xout = times$time)$y)
}

# Interpolate the location at the times from the sequence
tracks_gps_linear <- mapply(approx_locs, tracks_gps_ls, tracks_gps_ls_time,
                          SIMPLIFY = F) %>% 
  do.call("rbind", .)  # convert list of tracks back to a single data.frame

# remove row names added by rbind
rownames(tracks_gps_linear) <- NULL 

# plot locations
plot(tracks_gps_linear$x, tracks_gps_linear$y, pch = 20, col = "red", xlab = "x", ylab = "y", asp = 1)
points(st_coordinates(tracks_gps)[,"X"], st_coordinates(tracks_gps)[,"Y"], pch = 20)

Looks like it works. Let’s try fitting an HMM to this. First, lets prepare the data using prepData and plot the data to estimate starting parameters.

prep_gps_linear <- prepData(tracks_gps_linear, type = "UTM")

plot(prep_gps_linear, ask = F)

# Setting up the starting values
mu0 <- c(50, 500) # Mean step length
sigma0 <- c(50, 500) # Sd of the step length
kappa0 <- c(0.1, 1) # Turning angle concentration parameter (kappa > 0)

Ok, were are ready. Let’s fit the HMM

set.seed(1)
# Fit a 2 state HMM
HMM_gps_linear <- fitHMM(prep_gps_linear, nbState = 2, dist = list(step = "gamma",
angle = "vm"), Par0 = list(step = c(mu0, sigma0), angle = kappa0), formula = ~ 1)

plot(HMM_gps_linear, ask = FALSE)
## Decoding state sequence... DONE

That model does not look good. It’s clear from the mapped states that state 1 largely represents the original data, while state 2 primarily represents the interpolated data. This is a common problem when data gaps are frequent or large such that the information in the interpolated data outweighs the signal from the original observations. Generally, I would only use linear interpolation when data gaps are small (< 3 locations) or relatively infrequent (< 20 % of the modelled locations). In our data, some gaps are > 5 hours (30 locations) and > 50% of the modelled locations are interpolated. So we need to use another approach.

A slightly better way to interpolate locations is to fit a correlated random walk (CRW) model to the data and predict the most likely locations. momentuHMM contains wrapper functions to interpolate missing locations by fitting continuous-time CRW to the data based on the crawl package by Devin Johnson and Josh London. There are many options to fit the crawl model, and a detailed tutorial for analysis with crawl is available here: https://jmlondon.github.io/crawl-workshop/crawl-practical.html. Let’s try to fit the most basic model using the wrapper function crawlWrap.

set.seed(1) # crawl often fails, so I recommend always setting a seed 
# fit crawl
crw_gps_10 <- crawlWrap(obsData = tracks_gps, timeStep = "10 mins")
## Fitting 3 track(s) using crawl::crwMLE...
## crawl 2.2.1 (2018-09-12) 
##  Demos and documentation can be found at our new GitHub repository:
##  https://dsjohnson.github.io/crawl_examples/
## Loading required package: rngtools
## Individual T172062...
## Beginning SANN initialization ...
## Beginning likelihood optimization ...
## Individual T172064...
## Beginning SANN initialization ...
## Beginning likelihood optimization ...
## Individual T172066...
## Beginning SANN initialization ...
## Beginning likelihood optimization ...
## Warning in sqrt(diag(Cmat)): NaNs produced
## DONE
## Warning in eval(c.expr, envir = args, enclos = envir): crawl::crwMLE for
## individual T172066 has NaN variance estimate(s)
## Predicting locations (and uncertainty) at 10 mins time steps for 3 track(s) using crawl::crwPredict...
## Warning in nax != nay: longer object length is not a multiple of shorter object
## length
## Warning in nax != nay: longer object length is not a multiple of shorter object
## length

## Warning in nax != nay: longer object length is not a multiple of shorter object
## length
## DONE
# plot fit tracks
plot(crw_gps_10, ask = F)

Notice how the predicted tracks do not make perfectly straight lines through missing sections (particularly noticeable in T172062). Next, we will extract the predicted locations and add them to the observed data.

# filter predicted locations
tracks_gps_crw <- data.frame(crw_gps_10$crwPredict) %>% 
  filter(locType == "p") %>% 
  dplyr::select(mu.x, mu.y, time,
         ID) %>% 
  dplyr::rename(x = "mu.x", y = "mu.y")

Now, let’s try to fit the same HMM as before on this data using the same starting parameters.

set.seed(1)
# prep data 
prep_gps_crw <- prepData(tracks_gps_crw, type = "UTM")

# Setting up the starting values
mu0 <- c(50, 500) # Mean step length
sigma0 <- c(50, 500) # Sd of the step length
kappa0 <- c(0.1, 1) # Turning angle concentration parameter (kappa > 0)

# Fit a 2 state HMM
HMM_gps_crw <- fitHMM(prep_gps_crw, nbState = 2, dist = list(step = "gamma",
    angle = "vm"), Par0 = list(step = c(mu0, sigma0), angle = kappa0))

plot(HMM_gps_crw, ask = F)
## Decoding state sequence... DONE

That’s looking much better. It looks like state 1 represents a low-speed, high tortuosity resident state, while state 2 represents higher-speed, low tortuosity travelling state.

In many instances, this model may be sufficient. However, the significant proportion of interpolated locations used to fit the model is likely to affect our results. For example, the large interpolated gaps are still relatively straightened out and a very consistent speed, and may skew the definition of state 2 in particular. Simply interpreting the results with this issue in mind can be adequate when we only have few interpolated locations. Below, we will explore more formal ways to address this problem, which can be particularly useful when we have very large numbers of interpolated locations.

3.2.2 Approach 2. Nullify data gaps

One strategy to address large data gaps is to nullify the data streams (i.e., step length and turning angle) during moderate/large interpolated gaps where we expect that the estimated movement has is largely independent of the observed data before or after the gap. The maximum size of a gap to allow depends on the frequency of the missing data, frequency of locations, study species, and behaviours of interest. In this case, I will nullify predicted locations in gaps larger than 60 min. First, we will identify which steps need to be nullified, then we will prepare the data and nullify the estimated step and angle data streams. We will do this again later in the tutorial, so we will wrap it into a function called prepData_NAGaps.

# define function to replace step and turn angle of large gaps with NA
NA_gaps <- function(tracks, times){
  # rows where location is within a large gap
  rows <- which(
    rowSums(apply(times, 1, function(X, tracks){
      dplyr::between(tracks, 
                     as.POSIXct(X[1], tz = "UTC"),
                     as.POSIXct(X[2], tz = "UTC"))
    }, tracks$time))==1)
  tracks$step[rows] <- NA
  tracks$angle[rows] <- NA
  return(tracks)
}

# define function to identify and nullify gaps
prepData_NAGaps <- function(track_list, tracks_crw, res, max_gap, ...){
  # for each ID, identify which rows have gaps >= max_gap 
  gaps_ls_rows <- lapply(track_list, function(x){
    which(difftime(lead(x$time), x$time, units = "mins") >= max_gap)
  })
  
  # create sequence of times for each track from gaps >= 60 min
  gap_ls <- mapply(FUN = function(track, gaps){
    # identify start and end date of each gap
    gaps_ls_srt_end <- list(start = ceiling_date(track$time[gaps], paste(res, "min")),
                            end = floor_date(track$time[gaps+1], paste(res, "min")))
    # combine into single vector for each track
    data.frame(start = gaps_ls_srt_end$start, end = gaps_ls_srt_end$end)
  },
  track_list, gaps_ls_rows, SIMPLIFY = F)
  
  # prep data and list by ID
  prep_tracks <- prepData(tracks_crw, ...) %>% 
    {split(., .$ID)}
  
  # Interpolate the location at the times from the sequence
  mapply(FUN = NA_gaps, prep_tracks, gap_ls,
         SIMPLIFY = F) %>% 
    do.call("rbind", .) # convert list of tracks back to a single data.frame
}
  
prep_tracks_gps_crw_NAgaps <- prepData_NAGaps(tracks_gps_ls, tracks_gps_crw, 10, 60, type = "UTM")

Now, let’s try to fit the same HMM as above to this data with large gaps nullified.

set.seed(1)
# Setting up the starting values
mu0 <- c(50, 500) # Mean step length
sigma0 <- c(50, 500) # Sd of the step length
kappa0 <- c(0.1, 1) # Turning angle concentration parameter (kappa > 0)

# Fit a 2 state HMM
HMM_gps_crw_NAgaps <- fitHMM(prep_tracks_gps_crw_NAgaps, nbState = 2, dist = list(step = "gamma", angle = "vm"), Par0 = list(step = c(mu0, sigma0), angle = kappa0))

plot(HMM_gps_crw_NAgaps, ask = FALSE)
## Decoding state sequence... DONE

Visually, the difference is subtle.

HMM_gps_crw$mle[c("step", "angle")]
## $step
##       state 1  state 2
## mean 258.1724 705.2750
## sd   158.3769 255.9088
## 
## $angle
##                state 1  state 2
## mean          0.000000  0.00000
## concentration 2.706698 23.60863
HMM_gps_crw_NAgaps$mle[c("step", "angle")]
## $step
##       state 1  state 2
## mean 282.3941 757.6624
## sd   156.3310 248.9599
## 
## $angle
##                state 1  state 2
## mean          0.000000  0.00000
## concentration 2.476524 19.03249

However, the estimated parameters are quite different whether you account for the large gaps in data. When you nullify large gaps, the mean step length for both states is higher, and the turn angle concentration parameters is lower for both states (i.e., more tortuous). The fact that the parameters change for both states, suggests that the large gaps skewed the parameterisation of both states.

3.2.3 Approach 3. Path segmentation

Another strategy to deal with larger gaps is to segment the tracks with a new individual ID. This may be particularly appropriate for gaps where we may reasonably expect that the the underlying states are effectively independent of one another. Specifically, we may ask, over what period of time does the behaviour of the animal affect the subsequent behaviour. In this case, we may expect that the behaviour of a narwhal depends on the behaviour over the proceeding several hours, however is independent after 24 hours. We can split the tracks for gaps larger than a predetermined threshold by iterating the ID column. We will not implement this approach in this tutorial, however, it can be done using the following code:

gap_thresh <- 3*60 # in hours (3h for illustration)
# gaps no more than 6h per segment
new_ID <- (tracks_gps$dt > gap_thresh | tracks_gps$ID != lag(tracks_gps$ID)) %>%  # if dif.time > gap_thresh or new ID
  replace_na(TRUE) %>%  # replace first NA with ID = 1
  cumsum() %>%  # iterate ID 
  paste(tracks_gps$ID, ., sep = "_")
# then smaller gaps <= gap_thresh can be interpolated with crawlWrap

3.3 Dealing with location error and irregular time

Thus far, we have been using exclusively the GPS data, which can be assumed to have a negligible error in most applications. However, in many marine systems, we may obtain ARGOS locations, which can have highly variable location error on the order of hundreds of meters to hundreds of km. In addition, ARGOS data often provides location estimates in very irregular time intervals. crawlWrap can be used not only to predict impute missing data, but also to predict location when faced with high location error and irregular time-series.

Visualize original data that includes Argos location.

# convert to lines and plot using tmap
tracks %>% 
  group_by(ID) %>% 
  summarise(do_union = FALSE) %>% 
  st_cast("LINESTRING") %>% 
  tm_shape() + 
  tm_sf(col = "ID", palette = "Dark2")

It looks like there are at least several outlier points. We can address most of these using a coarse speed filter. However, before we can calculate speed (\(\frac{\Delta dist}{\Delta t}\)), we must address the duplicate time-stamps (since we cannot divide by 0 to calculate speed). It is often not possible to reliably identify which record is more accurate to discard. Therefore, it’s best to retain as as much data and offset duplicate times slightly. We will first offset duplicate times by adding 10s for each consecutive row with a dt == 0.